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Abstract 

The W3C Voice Browser working group aims to develop specifications to enable access 
to the Web using spoken interaction. This document is part of a set of specifications for 
voice browsers, and provides details of an XML markup language for describing the 
meanings of individual natural language utterances. It is expected to be automatically 
generated by semantic interpreters for use by components that act on the user's 
utterances, such as dialog managers. 

Status of this Document 



This document is a W3C Working Draft for review by W3C members and other 
interested parties. It is a draft document and may be updated, replaced, or obsoleted by 
other documents at any time. It is inappropriate to use W3C Working Drafts as 
reference material or to cite them as other than "work in progress". A list of current 
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public W3C Working Drafts can be found at http://www.w3.org/TR . 

This specification describes markup for representing natural language semantics, and 
forms part of the proposals for the W3C Speech Interface Framework. This document 
has been produced as part of the W3C Voice Browser Activity , following the procedures 
set out for the W3C Process . The authors of this document are members of the Voice 
Browser Working Group (W3C Members only). This document is for public review, and 
comments and discussion are welcomed on the public mailing list < www-voice(9)w3. 
org>. To subscribe, send an email to < www-voice-reouest@w3.org > with the word 
subscribe in the subject line (include the word unsubscribe if you want to 
unsubscribe). The archive for the list is accessible online. 



General Issues 

The NL semantics representation uses the data models of the W3C XForms draft 
specification to represent application-specific semantics. While XForms syntax may 
change in future revisions of the specification, it is not expected to change in ways that 
affect the NL Semantics Markup Language significantly. 



Table of Contents 
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o 7.3 Representing Information Collected over the Course of a Dialog 
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1. Introduction 

This document presents an XML specification for a Natural Language Semantics 
Markup Language, responding to the requirements documented in W3C Natural 
Language Processing Requirements for Voice Browsers. This markup language is 
intended for use by systems that provide semantic interpretations for a variety of inputs, 
including but not necessarily limited to, speech and natural language text input. These 
systems include Voice Browsers, web browsers and accessible applications. 

It is expected that this markup will be used primarily as a standard data interchange 
format between Voice Browser components; in particular, it will normally be 
automatically generated by a semantic interpretation component to represent the 
semantics of users' utterances and will not be directly authored by developers. 

The language is focused on representing the semantic information of a single utterance, 
as opposed to (possibly identical) information that might have been collected over the 
course of a dialog. See the Future Study section for a detailed discussion of returning 
information from a dialog. 

The language provides a set of elements that are focused on accurately representing 
the semantics of a natural language input. The following are the key design criteria. 

• Fidelity: The representation should be capable of accurately reflecting the user's 
intended meaning in terms of the application's goals. However, it should also 
provide a semantic interpreter with the means to represent vagueness and 
ambiguity when the user's meaning cannot be fully determined with the 
information available to the semantic interpreter. 

• Interoperability: The representation should support use along with other W3C 
specifications including (but not limited to) the Dialog Markup Language, Speech 
Grammar Markup Language . SMIL and XForms. 
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• Implementability: The required elements of the specification should be 
implementable with existing, generally available technology. 

• Extensibility: The specification should be extensible to accommodate emerging 
and future capabilities of automatic speech recognizers (ASR's), natural 
language interpreters, and voice browsers. For example, it should be compatible 
with statistical ASR's, mixed initiative dialogs and multi-modal components. 

• Architectural Neutrality: The specification should attempt wherever possible to 
avoid specifications which imply commitments to particular Voice Browser 
architectures, for example whether multi-modal integration takes place before or 
after natural language interpretation. 

• Portability: The specification should be able to support consistent behavior 
across platforms. 



This specification includes a set of draft elements and attributes and includes a draft 
DTD. 



1.1 Uses 

The general purpose of the NL Semantics Markup is to represent information 
automatically extracted from a user's utterances by a semantic interpretation 
component, where utterance is to be taken in the general sense of a meaningful user 
input in any modality supported by the platform. Referring to the sample Voice Browser 
architecture in Introduction and Overview of the W3C Speech Interface Framework , a 
specific architecture can take advantage of this representation by using it to convey 
content among various system components that generate and make use of the markup. 

Components that generate NL Semantics Markup: 

1. ASR 

2. Natural language understanding 

3. Other input media interpreters (e.g. DTMF, pointing, keyboard) 

4. Reusable dialog component 

5. Multimedia integration component 

Components that use NL Semantics Markup: 

1 . Dialog manager 

2. Multimedia integration component 

A platform may also choose to use this general format as the basis of a general 
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semantic result that is carried along and filled out during each stage of processing. In 
addition, future systems may also potentially make use of this markup to convey 
abstract semantic content to be rendered into natural language by a natural language 
generation component. 

1 .2 Markup Functions > 

A semantic interpretation system that supports the Natural Language Semantics 
Markup Language is responsible for interpreting natural language inputs and formatting 
the interpretation as defined in this document. Semantic interpretation is typically either 
included as part of the speech recognition process, or involves one or more additional 
components, such as natural language interpretation components and dialog 
interpretation components. See the Voice Browser Architecture described in http://www. 
w3.org/TR/voice-intro/ for a sample architecture. 

The elements of the markup fall into the following general functional categories: 
Input formats andASR information: 

The " input " element, representing the input to the semantic interpreter. 
Interpretation: 

Elements and attributes representing the semantics of the user's utterance, including 
the " result ", " interpretation ", " model ", and " instance " elements. The "result" element 
contains the full result of processing one utterance. It may contain multiple 
"interpretation" elements if the interpretation of the utterance results in multiple 
alternative meanings due to uncertainty in speech recognition or natural language 
understanding. There are at least two reasons for providing multiple interpretations: 

1. another component, such as a dialog manager, might have additional 
information, for example, information from a database, that would allow it to 
select a preferred interpretation from among the possible interpretations returned 
from the semantic interpreter. 

2. a dialog manager that was unable to select between several competing 
interpretations could use this information to go back to the user and find out what 
was intended. For example, Did you say "Boston" or "Austin"? 

The "model" is an XForms data model for the semantic information being returned in 
the interpretation. The "model" is a structured representation of the interpretation and 
allows for type checking. The "instance" is an instantiation of the data model containing 
the semantic information for a specific interpretation of a specific utterance. For 
example, the information in a travel application might include three groups of 
information: flights, car rental and hotels. The flight information, in turn, could contain 
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values for "to_city", "from_city", "departure_date" and so on, which would be typed as 
strings. 

Side Information: 

Elements and attributes representing additional information about the interpretation, 
over and above the interpretation itself. Side information includes 

1. Whether an interpretation was achieved (the "nomatch" element) and the 
system's confidence in an interpretation (the "confidence" attribute of 
"interpretation"). 

2. Alternative interpretations (" interpretation ") 
Multi-modal integration: 

When more than one modality is available for input, the interpretation of the inputs 
needs to be coordinated. The "mode" attribute of " input " supports this by indicating 
whether the utterance was input by speech, dtmf, pointing, etc. The timestamp 
attributes of "input" also provide for temporal coordination by indicating when inputs 
occurred. 

1.3 Overview of Elements and their Relationships 

This figure shows a graphical view of the relationships among the elements of the 
Natural Language Semantics markup. 
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incoming data 



mode 

time stamp start 
time stamp end 
confidence 



grammar 

x-modef 

xmfns 



confidence 
grammar 
x-modef 
xmfns 




Application-specific 
elements defined by 
X Forms data model 



The elements shown in the graphic fall into two categories: 

1. description of the input to be processed; shown in the left box, "incoming data" in 
blue. 

2. description of the meaning which was extracted from the input; shown in the right 
box, "meaning", in yellow. 

Next to each element in the graphic are its attributes in italics. In addition, some 
elements can contain multiple instances of other elements. For example, a "result" can 
contain multiple "interpretations", each of which is taken to be an alternative. The 
element "xfimodel" is an XForms data model as specified in the XForms data model 
draft, and therefore is not defined in this document. 



To illustrate the basic usage of these elements, as a simple example, consider the 
utterance ok. (interpreted as "yes") The example illustrates how that utterance and its 
interpretation would be represented in the NL Semantics markup. 



<result x-model= "http : / / the Ye sNoModel " 
xmlns :xf ="http: //www. w3 . org/2 000/xforrns" 
grammar= "http: //the YesNoGrammar> • 
<interpretation> 
<xf : instance > 
<myApp: yes no> • . 
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< r e sponse >ye s < / re sports e > 
< I myApp : ye s_no > 
</xf : instance> 
< input >ok< / input > ; 
</ interpretation .V;.''"-;V 
</result> i^- 



This example includes only the minimum required information, i.e., it does not include 
any of the optional information defined in this document. There is an overall "result" 
element which includes one interpretation. The data model is defined externally by 
referring to the URI for "theYesNo Model". This external model defines a "response" 
element. The "myApp" namespace refers to the application-specific elements that are 
defined by the XForms data model. 



2. Elements and Attributes 



2.1 "result" Root Element 



Attributes: grammar, x-model, xmlns 



The root element of the markup is "result". The "result" element includes one or more 
" interpretation " elements. Multiple interpretations result from ambiguities in the input or 
in the semantic interpretation. If the "grammar", "x-model", and "xmlns" attributes don't 
apply to all of the interpretations in the result they can be overridden for individual 
interpretations at the "interpretation" level. 



Attributes: 



1. grammar: The grammar or recognition rule matched by this result. (The format of 
the grammar attribute will match the rule reference semantics defined in the 
grammar specification. ) The grammar can be overridden by a grammar attribute 
in the "interpretation" element if the input was ambiguous as to which grammar it 
matched. 

2. x-model: The URI which defines the XForms data model used for this result. The 
data model used by the interpretation can either be specified here or by an in-line 
data model using the " model " element, (optional) The x-model can be overridden 
by an x-model attribute in the "interpretation" element if the input was ambiguous 
as to which x-model it matched. 

3. xmlns: An XML namespace declaration is required to define the namespace 
used by XForms elements and attributes. The DTD defaults the "xmlns" 
namespace declaration to a standard location, since it will rarely change. 



<result grammar- " http: / /grammar x-model = " http \ fj dataMode 1 " 
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2.2 "interpretation" Element 



Attributes: confidence, grammar, x-model, xmlns 

An "interpretation" element contains a single semantic interpretation. 



Attributes: 



1. confidence: an integer from 0-100 indicating the semantic analyzer's confidence 
in this interpretation. At this point there is no formal, platform-independent, 
definition of confidence, (optional) 

2. grammar: The grammar or recognition rule matched by this interpretation (if 
needed to override the grammar specification at the "interpretation" level.) The 
dialog markup interpreter needs to know the grammar rule that is matched by the 
utterance because multiple rules may be simultaneously active. The value that is 
filled in is the grammar URI used by the dialog markup interpreter to specify the 
grammar. The format of the grammar attribute will match the rule reference 
semantics defined in the grammar specification. Specifically, the rule reference . 
will be in the external XML form for grammar rule references. This attribute will 
only be needed under "interpretation" if it is necessary to override a grammar that 
was defined at the "result" level.) (optional) 

3. x-model: The location of the XForms data model used for this interpretation. The 
XForms data used by the interpretation may either be specified here or by an in- 
line data model using the " model " element. (As in the case of "grammar", this 
attribute only needs to be defined under "interpretation" if it is necessary to 
override the x-model specification at the "interpretation" level.) (optional) 

Interpretations must be sorted best-first by some measure of "goodness". The 
goodness measure is "confidence" if present, otherwise, it is some platform-specific 
indication of quality. 



The x-model and grammar are expected to be specified most frequently at the "result" 
level, because most often one data model will be sufficient for the entire result. 
However, it can be overridden at the "interpretation" level because it is possible that 
different interpretations may have different data models - perhaps because they match 
different grammar rules. 

The "interpretation" element includes an " input " element which contains the input being 
analyzed, optionally a " model " element defining the XForms data model and an 
" instance " element containing the instantiation of the data model for this utterance. The 
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data model would be empty if the interpreter was not able to produce any interpretation. 



< interpretation confidences "75" grammar="http: //grammar" 
x - mode 1= "http : //dataModel " 
xmlns :xf =" http: //www. w3 . org/2 000/xforms "> 

</interpretation> 



2.3 "model" Element 



The "model" element contains an XForms data model for the data and is part of the X- 
Forms name space. The XForms data model provides for a structured data model 
consisting of groups, which may contain other groups or simple types. Simple types can 
be one of: string, boolean, number, monetary values, date, time of day, duration, URI, 
binary. For further information on XForms data models see the X-Forms data model 
specification. Note that XForms fields default to optional. 



If no data model is supplied by either the "model" element or the "x-model" attribute 
then it is assumed that the data model will be provided by the dialog (or whatever other 
process receives the NL semantic mark-up). 

It is an error to specify both an x-model attribute and a "model" element. 



Example: An XForms data model for name and address. 



- — — 



<string name=" state "/> 
<string name =" zip 11 > 

< ma sk >ddddd< /ma s k > 
</st,ring> 

-<xf:/grdip> - >* - 1 m m 

</model> 



<model> 

<xf : group name= 11 name Address 11 > 
< s tring name= "name " / > 

< string name=" street "/> \i„ i-. ; 

§ < string name| [ ! , city"/> - >4i- ^ ^Mp~' 



2.4 "instance" Element 



The "instance" element contains an instance of the XForms data model for the data and 
is part of the XForms name space. 
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Attributes: 

1. confidence: All elements of the data instance may have an optional confidence 
attribute, defined in the NL semantics namespace. The confidence attribute 
contains an integer value in the range from 0-100 reflecting the system's 
confidence in the analysis of that slot. The meaning of confidence scores has not 
been defined in a platform-independent way. (optional) 



The use of a confidence attribute from the NL semantics namespace does not appear 
to present any document validation problems. However if future XForms specifications 
support an equivalent attribute then that would be preferable to the current proposal. 




2.5 "input" Element 



The "input" element is the text representation of a user's input. It includes an optional 
"confidence" attribute which indicates the recognizer's confidence in the recognition 
result (not the confidence in the interpretation, which is indicated by the "confidence" 
attribute of "interpretation"). Optional "timestamp-start" and "timestamp-end" attributes 
indicate the start and end times of a spoken utterance, in ISO 8601 format (http://www. 
iso.ch/markete/8601 .pdf ). 

Attributes: 

1 . timestamp-start: The time at which the input began, (optional) 

2. timestamp-end: the time at which the input ended, (optional) 

3. mode: The modality of the input, for example, speech, dtmf, etc. (optional) 

4. confidence: the confidence of the recognizer in the correctness of the input 
(optional) 

Note that it doesn't make sense for temporally overlapping inputs to have the same 
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mode; however, this constraint is not expected to be enforced by platforms. 

When there is no time zone designator, ISO 8601 time representations default to local 
time. 

There are three possible formats for the "input" element, 
a) The "input" element can contain simple text: 



, < input conf idence = "100 11 modei 



"speech" >onions< / input > 



b) The "input" element can also contain additional "input" elements. Having additional 
input elements allows the representation to support future multi-modal inputs as well as 
finer-grained speech information, such as timestamps for individual words and word- 
level confidences. 



|<ariput:>f' . 

< input mode=" speech 



: i me s t amp -start 200 
'speech 




timestamp-end=»2000-04-03T0:00:00.2">fried</input> 
< input mode= 11 speech' 1 ; confidence = n 100 n 

timestamp-start= " 2000 - 04 - 03T0 : 00 : 00 . 25 " 
t ime stamp -end= " 2000 - 04 - 03T0 : 00 : 00,. 6 " >onions< / input > 
</ input > 





c) Finally, the "input" element can contain "nomatch" and "noinput" elements, which 
describe situations in which the speech recognizer (or other media interpreter) received 
input that it was unable to process, or did not receive any input at all, respectively. 

2.6 "nomatch" Element 

The "nomatch" element under "input" is used to indicate that the natural language 
interpreter was unable to successfully match any input. It can optionally contain the text 
of the best of the (rejected) matches. 









< interpretation, p| j f 
.- <instance/> < 










> < input > ; , 


tk \ ■ • 




<ndmatch/ > : 

■ " ". • " » . . </ input > ^ , • ■ ■ ■ ^ ; = ; : =" ^ 'J It \ ] • 


< f ■ - : £ ' .- ' J ...pit V: ; 




< /interpret at ion> i f* 


;' ' " " v: ' : ; A : - | 

— 1 ■ - r , • ■ ■ ¥i r ::n',;\ 
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2.7 "noinput" Element 



The "noinput" element under "input" is used to indicate that there was no input- a 
timeout occurred in the speech recognizer due to silence. 



<interpretation> 
<instance/> 
< input > 

<noinput/> 
</ input > 

</interpretation> 



If there are multiple levels of inputs, it appears that the most natural place for the 
"nomatch" and "noinput" elements is under the highest level of "input" for "no input", 
and under the appropriate level of "input" for "nomatch". So "noinput" means "no input 
at all" and "nomatch" means "no match in speech modality" or "no match in dtmf 
modality"- For example, to represent garbled speech combined with dtmf "1 2 3 4", we 
would have the following: 



< input > ;., -y^ 

< input - -mode ch '0< / input > f 

< input mode=" dtmf 11 >1 2 3 4</input> * f 

</ inpUt > ^ -;-xf ffK _ H : " & 



2.8 Interpreting Meta-Dialog and Meta-Task Utterances 

The natural language requirements state that the semantics specification must be 
capable of representing a number of types of meta-dialog and meta-task utterances. 
This specification is flexible enough so that meta utterances can be represented on an 
application-specific basis without defining specific formats in this specification. 

Here are two examples of how meta-task and meta-dialog utterances might be 
represented. 



System: What toppings do you want on your pizza? 
User: What toppings do you have? 



< interpretation gramma r= "http : //toppings" 
xmlns : xf = 11 ht tp : / /www . w3 . org/ 2000 /xf orms " > 
< input mode= 11 speech" > 

what toppings do you have? 
</ input > 
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<xf :x-model> 

<xf : group xf : name = " question"/ > 

<xf : string xf :name=" quest ioned_it em" /> 
<xf : string xf : name= 11 quest ioned_property " /> 
</xf : group > 
</xf :x-model> 
<xf : instance> 
<xf :question> 

<xf : quest ioned__item>toppings</xf : quest ioned_item> 
<xf : questioned_j?roperty> 
availability 

</xf : questioned_property> 
</xf ;question> Mk- , . 

</xf : instance > " 1 T 

</interpretation> 



User: slow down. 



< in t e rp r e ta t i on grammar- |http : /y^^eralC^mandsQrammar H : y 
' cxmlns : xf^lhttp : //www . w3 : :feg/2 b ; £wi:f orms " > 
• <xf : model > 'T rl"' ■ 

< group name = 11 command "/> 
< string name= " action "/ > 
< string riamW= " doer " / > 
< /group > 
</xf : model > 
<xf : instance > & 
- <myApj5 : commaii#> 

<action>reduce speech rate</action> 
<doer>system</doer> 
</myApp:command> | ;; ^. 
</xf:iri|tance> t ^HF" 
< input mode=" speech" >s low down</ input > 
< / interpret at ion> 




: ,: 



2.9 Anaphora and Deixis 

This specification can be used on an application-specific basis to represent utterances 
that contain unresolved anaphoric and deictic references. Anaphoric references, which 
include pronouns and definite noun phrases that refer to something that was mentioned 
in the preceding linguistic context, and deictic references, which refer to something that 
is present in the non-linguistic context, present similar problems in that there may not 
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be sufficient unambiguous linguistic context to determine what their exact place in the 
data instance should be. In order to represent unresolved anaphora and deixis using 
this specification, the developer must define a more surface-oriented representation 
that leaves the interpretation of the reference open. (This assumes that a later 
component is responsible for actually resolving the reference) 

Example: (ignoring the issue of representing the input from the pointing gesture.) 



System: What do you want to drink? 

Use: I want this (clicks on picture of large root beer.) 




3. Extensibility 



One of the natural language requirements states that the specification must be 
extensible. The specification supports this requirement because of its flexibility, as 
discussed in the discussions of meta utterances and anaphora. The markup can easily 
be used in sophisticated systems to convey application-specific information that more 
basic systems would not make use of, for example defining speech acts, if this is 
meaningful to the dialog manager. Defining standard representations for items such as 
dates, times, etc. could also be done. 

4. Compliance 

Compliance issues are deferred until a later revision of the specification. 
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5. Document Type Definition 

(TBD) 

Leading and trailing spaces in utterances are not significant. This will be defined in the 
DTD by specifying "xml:space=default". 

6. Examples 

6.1 Simple Ambiguity: 

System: To which city will you be traveling? 
User: / want to go to Pittsburgh. 



< result xmlns : xf = "http : //www . w3 . ors/2 0 0.0/xf orms 
grammar = "http : //flight " > T ;p 

< interpretation conf idence= " 6 0 " > 

#<inp^.mo#e=»sfee.ch^ ■ % ,§;!,• ;j| 

I want to go 'to Pittsburgh I »-•!'• 



■ivy 



go to Pittsburgh f r ■% 

</ input > i , £■ .... ... . | -.'it 5 - -;J -jifc 

•^^xfribdel^ -f *- :ff © ; fr * : « 

< group name = "airline " > ' ... 

<string name="to city"/> h- ^ii - 

</ group > ? fc • I" 



/group: 
</xf : model > 

<xf :instaipe> 11 |* -f ■ |§t L 



■ :■ £ = 

ft 



<myApp : airl ine > 

< t o_c i£y> P it t sburgh< / t o__c i t y > 
* </iyApp«%irl*ie> *r ~ t 

</xf : instance > 

</interpretaJ:ion^: f sj. f : ' .-^ ^ 

< interpretation conf idence= "4 0 " 

< input > I want to go to Stockholm</ input > 
<xf : model > v | . ^ -rCi C 

<group name= " airline " > 

<string name=" to_city ,f /> 
</ group > 
</xf:model> 
<xf;instance> o 
<myApp : a i r 1 ine > 

<to_city>Stockholm</to_city> 
< /myApp : a i r 1 ine > 
</xf : instance> 
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</ interpretation 
</result> 



6.2 Mixed Initiative: 

System: What would you like? 

User: / would like 2 pizzas, one with pepperoni and cheese, one with sausage 
and a bottle of coke, to go. 

This representation includes an order object which in turn contains objects named 
"foodjtem", "drinkjtem" and "deliveryjriethod". This representation assumes there 
are no ambiguities in the speech or natural language processing. Note that this 
representation also assumes some level of intrasentential anaphora resolution, i.e., to 
resolve the two "one's" as "pizza". 



" T- ~ ~ " "^TT f" " . *T, 

<result xmlns : xf = "http : //www . w3> org/20 00 /xfprms 11 
^ramr^ A, 
^interpretation ccSif idence= 11 iOO 11 >% 
\ <xf : model > . a L v 

,||,, <g|roup ; iia|ae= ".order f ! Ji #"lffr; 

<c gr oup'name = " f ood^item" '\ maxOcf:urs=" * " > 
; < group name = "pizza" > | 1; 

■ jk : . < string |iame= n ingrefienfcs^ max(|icurs$ " * 11 / > 



_ |ring ; ^ame=-"angresiient!S ^maMqcu r s = " * 11 / > 

< /group > 'if '"l V 

< group name= "burger" >i I k 

jjg£ # <^ringr|name=^ maxf|:.cur s %" ,* / "c>| - ?^p' -i 

< /group > 1 
< /group > * v 

#groupftname# maxO&curs^"^ ^jj^ 

< string name="size"> 

< string name=" type"> A 2 

#/grou|i> ■ -i" *• #■ 

<string name="delivery_method"/> 
</group> ; 
</xf :modeli> < if :| f 

<xf : instance> 

<myApp : order > ,. ° 

<food_item confidence = " 100 " > pa 
<pizza> n ° 
<xf : ingredients conf idence=" 100" > 

pepperoni * 
</xf : ingredient s> 

<xf : ingredients conf idence=" 100 "> 
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cheese 
</xf : ingredient s> 
</pizza> 
<pizza> 

<ingredients>sausage< /ingredients > 
</pizza> 
</f ood_item> 

<drink_item conf idence="100" > 

<size>2-liter</size> 
< / dr i nk_i t em > 

<delivery__method>to go</delivery__method> 
</myApp : order > 
< /xf : instance^ ^ • ... . ,i, ■ , .:t 

< input mode= " speech" >I would like 2 pizzas, 

one with pepperoni and cheese, one with sausage 
and a bottle of coke, to go. £ ; 
</iiput> ,,,|..;;,. M. 
</ interpret at ion> 
</result> 

- : ik, LiiL 



6.3 DTMF: 



A combination of dtmf input and speech would be represented using nested input 
elements. For example: 



User: My pin is (dtmf 1 2 3 4) 

<fnput> If? *" *.'f" W 

<input mode= H speech" confidence ="100" 
; .' t ime s t amp - start ="2000-04-03T|p : 00 : OCgVj ^ 
ifr timeftamp-erici="2 000 PM| ;is 
</ input > | 
< input mode=" dtmf 11 confidence ="100" f 
timestamp~start= M 2000-04~03T0: 00: Olts " 
'» t ime st amp - erid= !l 2000-04-03T0*00:02. 0 fe 1 2 314 

</ input > -\ -f 

< / input > 



7. Future Study 

7.1 Representation of ambiguities 
In this mark-up ambiguities are only represented at the top-level, using separate 
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interpretation elements. Representation of "local" ambiguities, for example, at the level 
of an ambiguity between two ingredients {peppers vs. pepperoni) would be useful, but 
represents validation problems because of multiple namespaces unless the XForms 
specification includes it. The more compact representation using local ambiguities has 
not been defined for three reasons: 

1. It is not possible to combine ambiguities with the XForms notation and retain the 
ability to validate NL semantics documents using XML schema or DTDs. 

2. When multiple filler elements are allowed, as for example with pizza toppings, 
representation of ambiguity can become very complex and confusing. 

3. Although fully spelling out ambiguities at the top level results in a more verbose 
representation, current practical systems seldom make use of more than 2 
alternative interpretations, so the increase in verbosity from spelling out 
redundant information should not be too significant in practice. 

Local ambiguities may be supported in the future if representation of ambiguity 
becomes part of the XForms standard. 

7.2 Representing the source of an ambiguity 

If there is more than one interpretation, it may be useful to add an attribute specifying 
the source of the ambiguity, for example, "naturaljanguage", "speech", "ocr", or 
"handwriting" Speech ambiguities originate in uncertainties about the speech 
recognition result, for example, Austin vs. Boston, "handwriting" and "ocr" are 
analogous to speech. Natural language ambiguities result from syntactic, semantic, or 
pragmatic ambiguities in a single recognizer result. For example in / want fried onions 
and peppers, there are two interpretations, one in which the peppers are to be fried and 
one in which they are not to be fried. This attribute would not be meaningful if there is 
only one interpretation. This information could be used, for example, by a dialog 
manager to construct a more helpful response (e.g. / didn't hear that vs. / didn't 
understand that) or by a scoring algorithm that treats different ambiguity sources 
differently. 

7.3 Representing information collected over the course of a dialog 

In many cases identical information can be conveyed in one utterance or over the 
course of several dialog turns. This situation can occur both in the case of a subdialog 
or in the case of a reusable component. For example, if the system's goal in the 
subdialog or the reusable component is to collect travel information from a user, the 
ultimate information is the same whether the user says / want to go from Pittsburgh to 
Seattle on January 1, 2001, in a single utterance or whether the same information is 
elicited from the user during several dialog turns, as in 

System: Where will you be departing from? 
User: Pittsburgh. 
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System: Where will you be traveling to? 
User: Seattle. 

* 

etc. 

It should be possible to use a substantially similar semantic representation in both of 
these situations. The main issue is that in the case of information collected over the 
course of a dialog it becomes very difficult to tie that information back to the original 
inputs. Elements such as "input" and attributes such as "timestamp-start", "timestamp- 
end", "grammar", and "mode" which relate the semantic interpretation directly to the 
input become less meaningful when the information is collected in a dialog. Moreover, 
they also become less useful to the main dialog component, since presumably it's the 
function of the subdialog or reusable component to make use of this low-level 
information internally to guide its own dialog and to shield the main dialog from these 
details. One strategy under consideration is simply to omit these aspects of the markup 
for dialog-based semantic information. This issue may also be dealt with in the reusable 
components group, since the issue of return information is key to its charter. 

7.4 Composition of multiple data models within one utterance 

Some utterances could potentially make use of more than one data model in their 
semantic representations. For example it is possible in a mixed initiative situation for the 
user to combine multiple functions in one utterance, as in: 

System: / heard you say you want to go to Pittsburgh, is that correct? 

User: Yes, and I'll be leaving around 8:00 a.m. 

It would be natural for there to be a generic data model for the "yes" and also an 
application-specific model for the flight arrangements. One possibility would be for the 
interpreter to create one joint data model on the fly from these models. Or, the 
developer could define one data model that includes both elements for "yes_no" and for 
the application-specific information. If there are two data models, and consequently two 
instances, then it is necessary to consider the problem of associating the instances with 
the correct data models. 

7.5 Representation of Multi-modal input 

This is deferred until the specification for multi-modal inputs is better defined, except for 
dtmf (for dtmf, see the example above) 

7.6 Extensibility of XForms data models 

It would be highly desirable if components in the dialog system could extend the data 
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model so that grammars or reusable components could return information that is 
additional to a base data model for, say, a time or date component or grammar. With 
the current XForms specification it would be necessary to provide a complete new data 
model in these cases. It is possible that the XForms working group may extend the 
XForms specification to include extensibility of the data model. 

Similarly, the current XForms data model definition does not provide for the re-use of 
complex type definitions, i.e. groups, in multiple locations. Thus, to represent travel 
information consisting of both an outbound flight and an inbound flight, it is not possible 
to define a single complex type "flight_details" that is used for both outbound and 
inbound flight information. (See the section on "Shared Datatype Libraries" in the 
XForms Data Model document for additional discussion.) 

7.7 Representation of recursive structures 

Some systems may find it useful to represent generic syntactic parse trees in natural 
language output. Generic parse trees cannot be represented by current XForms data 
models because they do not support any recursion. However, it is not clear how 
frequently this capability would be required. 

7.8 Representing unanalyzed information: "unanalyzed" Element 

An "unanalyzed" element could be used to represent a part of the input that was left 
unanalyzed in the current interpretation. This element could be used by a dialog 
manager to decide if enough of the input had been analyzed for the dialog to proceed, 
or if the dialog manager should ask for a clarification from the user. The dialog manager 
could also use the unanalyzed material to help it decide which of several alternative 
interpretations is correct. Each "unanalyzed" element would contain "input" elements 
which would contain the portions of the full utterance that was unanalyzed. 

"unanalyzed" has not been included in the current version of the spec for several 
reasons: 

1 . It's not clear that it has a platform-independent interpretation. 

2. It's not clear that current applications would make use of it. 

3. Although there is a requirement for representing "unanalyzed", this can be 
accommodated in the current specification if the developer incorporates 
"unanalyzed" into the data model in an application-specific manner. In addition, 
natural language interpreters can take unanalyzed information into account 
internally when they are computing confidences, so that this information is 
available indirectly to dialog managers through the confidence attributes. 
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The most important consideration appears to be whether in fact the ability to represent 
unanalyzed material is of interest to current or near future applications. 

Note that the use of "unanalyzed" would be mainly useful for systems with robust 
natural language interpreters which are capable of ignoring portions of the speech 
recognizer result that don't match the natural language grammar. In the case of tightly 
coupled ASR/NL systems which require that all of the input match a speech recognizer 
grammar the notion of "unanalyzed" isn't useful, since all of the input is required to be 
analyzed by the nature of the system. Similarly, keyword spotting systems with garbage 
models will not be able to make use of this element because the speech recognition 
process discards any unrecognizable speech before the natural language interpretation 
process begins. 

Example: 

System: Where do you want to go? 

User: I'd like to fly from Boston and then continue on to Philadelphia. 



(assuming that "and then continue on" is not included in the speech grammar.) 




If there is duplicated unanalyzed material, as in Please get my email please, every 
unanalyzed item should be represented individually, so please should be duplicated if 
both occurrences are unanalyzed. 
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